Building a Simple Linear Model to Better Understand Regression Methods
Get in Loser, We’re Fitting Lines to Data
model <- lm(exam_score ~ hours_studied, data = df)
df <-
df |>
mutate(
fitted = model$fitted.values,
residual = model$residuals
)
df |>
select(hours_studied, exam_score, fitted, residual) |>
janitor::clean_names(case = "title") |>
slice_sample(n = 10) |>
gt() |>
fmt_number(columns = c(Fitted, Residual), decimals = 2) |>
cols_align(align = "center", columns = everything())| Hours Studied | Exam Score | Fitted | Residual |
|---|---|---|---|
| 18 | 65 | 66.49 | −1.49 |
| 20 | 65 | 67.08 | −2.08 |
| 25 | 66 | 68.54 | −2.54 |
| 24 | 70 | 68.25 | 1.75 |
| 27 | 71 | 69.12 | 1.88 |
| 23 | 69 | 67.95 | 1.05 |
| 28 | 74 | 69.41 | 4.59 |
| 11 | 62 | 64.45 | −2.45 |
| 18 | 61 | 66.49 | −5.49 |
| 10 | 61 | 64.16 | −3.16 |
coefs <- model$coefficients
df <-
df |>
mutate(
fitted_under = coefs[1] + (coefs[2] - 0.1) * hours_studied,
residual_under = exam_score - fitted_under
)
df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted_under),
linewidth = 1, colour = "#005EB8"
) +
labs(
x = "Hours Studied", y = "Exam Score"
)df <-
df |>
mutate(
fitted_over = coefs[1] + (coefs[2] + 0.2) * hours_studied,
residual_over = exam_score - fitted_over
)
df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted_over),
linewidth = 1, colour = "#005EB8"
) +
labs(
x = "Hours Studied", y = "Exam Score"
)set.seed(42)
df |>
slice_sample(n = 50) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted_under,
xend = hours_studied, yend = fitted_under + residual_under
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted_under),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")set.seed(42)
df |>
slice_sample(n = 50) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted_over,
xend = hours_studied, yend = fitted_over + residual_over
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted_over),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")set.seed(42)
df |>
slice_sample(n = 50) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")\[\min_{\hat\beta_0,\hat\beta_1} \sum_{i=1}^n \left( y_i - (\hat\beta_0 + \hat\beta_1 x_i) \right)^2\]
Solving for \(\beta_0\):
Differentiate: \(\frac{\partial}{\partial \hat{\beta}_0} \text{RSS} = -2 \sum_{i=1}^n \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)\)
Solve: \(\hat\beta_0 = \bar{y} - \hat\beta_1 \bar{x}\)
Solving for \(\beta_1\):
Differentiate: \(\frac{\partial}{\partial \hat{\beta}_1} \text{RSS} = -2 \sum_{i=1}^n x_i \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)\)
Solve: \(\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\)
Our good friends \(\beta_1\), \(\beta_0\), and \(\epsilon\).
\[ Y = \underbrace{\vphantom{\beta_0} \overset{\color{#41B6E6}{\text{Intercept}}}{\color{#41B6E6}{\beta_0}} + \overset{\color{#005EB8}{\text{Slope}}}{\color{#005EB8}{\beta_1}}X \space \space}_{\text{Explained Variance}} + \overset{\mathstrut \color{#ED8B00}{\text{Error}}}{\underset{\text{Unexplained}}{\color{#ED8B00}{\epsilon}}} \]
\[\beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]
\[\beta_0 = \bar{y} - \beta_1 \bar{x} \]
\[\hat{y}_i = \beta_0 + \beta_1 x_i \]
What Happens When We Add More Predictors?
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \epsilon\]
Where Next, Magic Math Man?
Contact:
Code & Slides:
Paul Johnson // Linear Regression from Scratch // Nov 28, 2024